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(54) Method for multistage speech recognition using confidence measures 



(57) In a method for recognizing speech commands, 
in which a group of command words selectable by 
speech commands are defined, a time window is de- 
fined, within which the recognition of the speech com- 
mand is performed. In a first recognition stage in the 
method, the recognition result of the first recognition 
stage is selected, for which a first confidence value is 
determined. Further in the method, a first threshold val- 
ue (Y) is determined, with which said first confidence 
value is compared. If said first confidence value is great- 
er than or equal to said first threshold value (Y), the rec- 
ognition result of the first recognition stage is selected 
as the recognition result of the speech command. If said 
first confidence value is smaller than said first threshold 
value (Y), a second recognition stage is performed for 
the speech command, wherein said time window is ex- 
tended, and a recognition result is selected for the sec- 
ond recognition stage. A second confidence value is de- 
termined for the recognition result of the second recog- 
nition stage and compared with said threshold value (Y). 
If said second confidence value is greater than or equal 
to said first threshold value (Y), the command word se- 
lected at the second stage is selected as the recognition 
result for the speech command. If said second confi- 
dence value is smaller than said first threshold value (Y), 
a comparison stage is performed, wherein the first and 
second recognition results are compared to find out at 
which probability they are substantially the same, 
wherein if the probability exceeds a predetermined val- 
ue, the command word selected at the second stage is 
selected as the recognition result for the speech com- 
mand. 
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Description 

[0001] The present invention relates to a method in 
the recognition of speech as set forth in the preamble of 
the appended claim 1 , a speech recognition device as 
set forth in the preamble of the appended claim 7, and 
a wireless communication device to be controlled by 
speech, as set forth in the preamble of the appended 
claim 9. 

[0002] For facilitating the use of wireless communica- 
tion devices, so-called hands free devices have been 
developed, whereby the wireless communication device 
can be controlled by speech. Thus, speech can be used 
to control different functions of the wireless communica- 
tion device : such as turning on/off, transmission/recep- 
tion, control of sound volume, selection of telephone 
number, or answering a call, whereby particularly in the 
use in a vehicle, it is easier for the user to concentrate 
on the driving. 

[0003] One drawback in a wireless communication 
device controlled by speech is that speech recognition 
is not fully faultless. In a car, the background noise 
caused by the environment has a high volume, thereby 
making it difficult to recognize speech. Due to the unre- 
liability of the speech recognition, users of wireless com- 
munication devices have so far shown relatively little in- 
terest in the control by speech. The recognition capabil- 
ity of present speech recognizers is not particularly 
good, especially under difficult conditions, such as in a 
moving car, where the high volume of background noise 
hampers reliable recognition of words substantially. In- 
correct recognition decisions cause most problems usu- 
ally in the implementation of the user interface, because 
incorrect recognition decisions may start undesired 
functions, such as terminating a call during a call, which 
is naturally particularly disturbing to the user. One result 
of an incorrect recognition decision may be that a call is 
connected to an incorrect number. For this reason, the 
user interface is designed in such way that the user usu- 
ally is asked to repeat a command if the speech recog- 
nizer does not have sufficient certainty of a word uttered 
by the user. 

[0004] Almost all speech recognition devices are 
based on the functional principle that a word uttered by 
the user is compared, by an usually rather complicated 
method, with a group of reference words previously 
stored in the memory of the speech recognition device. 
Speech recognition devices usually calculate a figure for 
each reference word to describe how much the word ut- 
tered by the user resembles the reference word. The 
recognition decision is finally made on the basis of these 
figures so that the decision is to select the reference 
word which the uttered word resembles most. The best 
known methods in the comparison between the uttered 
word and the reference words are dynamic time warping 
(DTW) and the statistical hidden Markov model (HMM). 
[0005] In both the DTW and the HMM methods, an 
unknown speech pattern is compared with known refer- 



ence patterns. In dynamic time warping, the speech pat- 
tern is divided into several frames, and the local distance 
between the speech pattern included in each frame and 
the corresponding speech segment of the reference pat- 

5 tern is calculated. This distance is calculated by com- 
paring the speech segment and the corresponding 
speech segment of the reference pattern with each oth- 
er, and it is thus a kind of numerical value for the differ- 
ences found in the comparison. For speech segments 

io close to each other, a smaller distance is usually ob- 
tained than for speech segments further from each oth- 
er. On the basis of local distances obtained this way, a 
minimum path between the beginning and end points of 
the word are sought by using a DTW algorithm. Thus, 

is by dynamic time warping, a distance is obtained be- 
tween the uttered word and the reference word. In the 
HMM method, speech patterns are produced, and this 
stage of speech pattern generating is modelled with a 
state change model according to the Markov method. 

20 The stale change model in question is thus the HMM. 
In this case, speech recognition on received speech pat- 
terns is performed by defining a observation probability 
on the speech patterns according to the hidden Markov 
model. In speech recognition by using the HMM method, 

2S an HMM model is first formed for each word to be rec- 
ognized, i.e. for each reference word. These HMM mod- 
els are stored in the memory of the speech recognition 
device. When the speech recognition device receives 
the speech pattern, a observation probability is calcu- 

30 lated for each HMM model in the memory, and as the 
recognition result, a counterpart word is obtained for the 
HMM model with the greatest observation probability. 
Thus for each reference word the probability is calculat- 
ed that it is the word uttered by the user. The above- 

35 mentioned greatest observation probability describes 
the resemblance of the received speech pattern and the 
closest HMM model, i.e. the closest reference speech 
pattern. 

[0006] Thus, in present systems the speech recogni- 
40 tion device calculates a certain figure for the reference 
words on the basis of the word uttered by the user. In 
the DTW method, the figure is the distance between the 
words, and in the HMM method, the figure is the prob- 
ability for the equality of the uttered word and the HMM 
45 model. When the HMM method is used, the speech rec- 
ognition devices are usually set a certain threshold prob- 
ability which the most probable reference word must 
achieve to make the recognition decision. Another factor 
affecting the recognition decision can be e.g. the differ- 
so ence between the probabilities of the most probable and 
the second probable word, which must be sufficiently 
great to make the recognition decision. Thus, it is pos- 
sible that when the background noise has a high vol- 
ume, on the basis of a command uttered by the user, 
55 the reference word in the memory, e.g. the reference 
word "yes", obtains at each attempt the greatest proba- 
bility in relation to the other reference words, e.g. the 
probability 0.8. If the threshold probability is for example 
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0.9, the recognition is not accepted and the user may 
have to utter the comnnand several times until the rec- 
ognition probability threshold is exceeded and the 
speech recognition device accepts the command, even 
though the probability may have been very close to the s 
acceptable value. This is very disturbing to the user. 
[0007] Furthermore, the speech recognition is ham- 
pered by the fact that different users utter the same 
words in different ways, wherein the speech recognition 
device works better when used by one user than when 10 
used by another user. In practice, it is very difficult with 
the presently known techniques to adjust the certainty 
levels of speech recognition devices to consider all us- 
ers. When adjusting the required certainty level e.g. for 
the word "yes" in speech recognition devices of prior art, is 
the required threshold is typically set according to so- 
called worst speakers. Thus, the problem emerges that 
words close to the word "yes" also become incorrectly 
accepted. The problem is aggravated by the fact that in 
some situations, mere background noise may be recog- 20 
nized as command words. In speech recognition devic- 
es of prior art, the aim is to find a suitable balance in 
which a certain part of the users have great problems in 
having their words accepted and the number of incor- 
rectly accepted words is sufficiently small. If the speech 25 
recognition device is adjusted in a way that a minimum 
number of users have problems in having their words 
accepted, this means in practice that the number of in- 
correctly accepted words will increase. Corresponding- 
ly, if the aim is set at as faultless a recognition as pos- 30 
sible, an increasing number of users will have difficulties 
in having commands uttered by them accepted. 
[0008] In speech recognition, errors are generally 
classified in three categories: 

35 

.Insertion Error 

The user says nothing but a command word is rec- 
ognized in spite of this, or the user says a word 
which is not a command word and still a command 
word is recognized. 40 

Deletion Error 

The user says a command word but nothing is rec- 
ognized. 

45 

Substitution Error 

The command word uttered by the user is recog- 
nized as another command word. 

[0009] In a theoretical optimum solution, the speech so 
recognition device makes none of the above-mentioned 
errors. However, in practical situations, as was already 
presented above, the speech recognition device makes 
errors of all the said types. For usability of the user in- 
terface, it is important to design the speech recognition ss 
device in a way that the relative shares of the different 
error types are optimal. For example in speech activa- 
tion, where a speech-activated device waits even for 
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hours for a certain activation word, it is important that 
the device is not erroneously activated at random. Fur- 
thermore, it is important that the command words ut- 
tered by the user are recognized at good accuracy. In 
this case, however, it is more important that no errone- 
ous activations take place. In practice, this means that 
the user must repeat the uttered command word more 
often so that it would be recognized correctly at a suffi- 
cient probability. 

[0010] In the recognition of a numerical sequence, al- 
most all errors are equally significant. Any error in the 
recognition of the numbers in a sequence results in a 
false numerical sequence. Also the situation that the us- 
er says ncthrng and still a number is recognized, is in- 
convenient for the user. However, a situation in which 
the user utters a number indistinctly and the number is 
not recognized, can be corrected by the user by uttering 
the numbers more distinctly. 

[0011] The recognition of a single command word is 
presently a very lypical function implemented by speech 
recognition. For example, the speech recognition device 
may ask the user: "Do you want to receive a call?", to 
which the user is expected to reply either "yes" or "no". 
In such situations where there are very few alternative 
command words, the command words are often recog- 
nized correctly, if at all. In other words, the number of 
substitution errors in such a situation is very small. The 
greatest problem in the recognition of single command 
words is that an uttered command is not recognized at 
all, or an irrelevant word is recognized as a command 
word. In the following, there are three different alterna- 
tive situations of this example: 

1 ) A speech-controlled device asks the user: "Do 
you want to receive a call?", to which the user re- 
plies indistinctly: "Yes ... ye- n . The device does not 
recognize the user's reply and asks the user again: 
M Do you want to receive a call? Say yes or no." Thus 
the user may be easily frustrated, if the device often 
asks the user to repeat the command word uttered. 

2) The device asks the user again: "Do you want to 
receive a call?", to which the user responds distinct- 
ly "yes". However, the device did not recognize this 
for certain and wants a confirmation: "Did you say 
yes?", to which the user replies again "yes". Even 
now, no reliable recognition was made, so the de- 
vice asks again: "Did you say yes?". The user must 
repeat again the reply "yes", for the device to com- 
plete the recognition. 

3) Still in a third example situation, the speech-con- 
trolled device asks the user, if s/he wants to receive 
a call. To this, the user mumbles something vague, 
and in spite of this, the device interprets the user's 
utterance as the command word "yes" and informs 
the user "All right, the call will be connected". Thus, 
in this situation, the interpretation of the device of 
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the user's vague speech is closer to the word "yes" 
than to the word "no". Consequently, in this situa- 
tion, words that resemble a command word begin 
to be incorrectly accepted. 

[0012] In speech recognition methods according to 
prior art, it is typical to use in the recognition of the com- 
mand word a time window of fixed length, during which 
the user must utter the command word. In another 
speech recognition method of prior art, the recognition 
probability is calculated for the command word uttered 
by the user, and if this probability does not exceed a 
predetermined threshold value, the user is requested to 
utter the command word again, after which a new cal- 
culation of the recognition probability is performed by 
utilizing the probability calculated in the previous recog- 
nition time. The recognition decision is made, if the 
threshold probability is achieved considering the previ- 
ous probabilities. In this method, however, the utilization 
of repelilion will easily result in an increase in Ihe chance 
of the above-mentioned insertion error, wherein upon 
repeating a word outside the vocabulary, it is more easily 
recognized as a command word. 

[001 3] It is an aim of the present invention to provide 
an improved speech recognition method as well as a 
speech-controlled wireless communication device in 
which speech recognition is secured in view of prior art. 
The invention is based on the idea that the recognition 
probability calculated for an uttered command word is 
compared with the probability of background noise, 
wherein the confidence value thus obtained is used to 
deduce whether the recognition was positive. If the con- 
fidence value remains below a determined threshold for 
a positive recognition, the time window used in the rec- 
ognition is extended and a new recognition is performed 
for the repeated utterance of the command word. If the 
repeated command word is not recognized at a suffi- 
cient confidence value, a comparison between the com- 
mand words uttered by the user is still performed; thus, 
in case the recognitions of the words uttered by the user 
indicate that the user has uttered the same command 
word two times in succession, the recognition is accept- 
ed. The method according to the present invention is 
primarily characterized in what will be presented in the 
characterizing part of the appended claim 1 . The speech 
recognition device according to the present invention is 
primarily characterized in what will be presented in the 
characterizing part of the appended claim 7. Further- 
more, the wireless communication according to the 
present invention is characterized in what will be pre- 
sented in the characterizing part of the appended claim 
9. 

[0014] The present invention provides significant ad- 
vantages to speech recognition methods and devices of 
prior art. With the method according to the invention, a 
smaller probability of insertion errors is obtained than is 
possible to achieve with methods of prior art. In the 
method of the invention, when the recognition is not cer- 



tain, the time for interpreting the command word is ex- 
tended, wherein the user has a possibility to repeat the 
command word given. According to the invention, it is 
additionally possible., if necessary, to effectively utilize 

5 the repetition of the command word by making a com- 
parison with the command word uttered earlier by the 
user, which comparison improves the recognition of the 
command word significantly. Thus, the number of incor- 
rect recognitions can be significantly reduced. Also the 

10 probability of such situations in which a command word 
is recognized although the user did not utter a command 
word, is reduced significantly. The method of the inven- 
tion renders it possible to use such confidence levels in 
which the number of incorrectly recognized command 

*5 words is minimal. Users who do not have their speech 
commands easily accepted in solutions of prior art can, 
by repetition of the command word according to the in- 
vention, significantly improve the probability of accept- 
ance of speech commands uttered. 

2° [0015] In the following, the invention will be described 
in more detail with reference to the appended drawings, 
in which 

Fig. 1 shows recognition thresholds used in the 
25 method according to an advantageous em- 

bodiment of the invention, 

Fig. 2 shows a method according to an advanta- 
geous embodiment of the invention in a state 
30 machine presentation, 

Fig. 3 illustrates time warping of feature vectors, 

Fig. 4 illustrates the comparison of two words in a 
35 histogram, and 

Fig. 5 illustrates a wireless communication device 
according to an advantageous embodiment of 
the invention in a reduced schematic diagram. 

40 

[0016] In the following, we will describe the function 
of the speech recognition device according to an advan- 
tageous embodiment of the invention. In the method, a 
command word uttered by the user is recognized by cal- 
45 cutating, in a way known as such, the probability for how 
close the uttered word is to different command words. 
On the basis of these probabilities calculated for the 
command words, the command word with the greatest 
probability is advantageously selected. After this, the 
50 probability calculated for the recognized word is com- 
pared with the probability produced by a background 
noise model. The background noise model represents 
general background noise and also all such words which 
are not command words. Thus, a probability is calculat- 
es ed here for the possibility that the recognized word is 
only background noise or a word other than a command 
word. On the basis of this comparison, a first confidence 
value is obtained, to indicate how positively the word 
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was recognized. Figure 1 illustrates the determination 
of the confidence of this recognition by using threshold 
values and said confidence value. In the method accord- 
ing to an advantageous embodiment of the invention, a 
first threshold value is determined, indicated in the ap- 
pended figure with the reference Y It is the limit deter- 
mined for the confidence value that the recognition is 
positive (the confidence value is greater than or equal 
to the first threshold value Y). In the method according 
to a second advantageous embodiment of the invention, 
also a second threshold value is determined: indicated 
in the appended figure with the reference A. This indi- 
cates whether the recognition was uncertain (the confi- 
dence value is greater than or equal to the second 
threshold value A but smaller than the first threshold val- 
ue Y) or very uncertain (the threshold value is smaller 
than the second threshold value A). 
[0017] In the state machine presentation of Fig. 2, 
state 1 represents the recognition of a command word. 
Al this stage of recognizing the command word, proba- 
bilities are determined on the basis of the command 
word uttered by the user for different command words 
in the vocabulary of the speech recognition device. As 
the command word corresponding to the speech com- 
mand uttered by the user is selected, preliminarily, the 
command word with the greatest probability. For the se- 
lected command word, said confidence value is deter- 
mined and compared with the first threshold value Y and 
the second threshold value A to deduce whether the rec- 
ognition was certain, uncertain or very uncertain. If the 
confidence value is greater than or equal to the first 
threshold value 6, the operation moves on to state 4 to 
accept the recognition. If the confidence value remained 
smaller than the second threshold value A, the operation 
moves on to state 5 to exit from the command word rec- 
ognition, i.e. to reject the recognition. If the confidence 
value was greater than or equal to the second threshold 
value A but smaller than the first threshold value Y, the 
recognition was uncertain, and the operation moves on 
to state 2. Thus, the time window is extended, i.e. the 
user will have more time to say the uttered command 
word again. The operation can move on to this state 2 
also because of an incorrect word, e.g. as a result of a 
word uttered very unclearly by the user or as a result of 
incorrect recognition caused by background noise. In 
this state 2, the repetition of the command word is waited 
for the time of the extended time window. If the user ut- 
tered a command word again in this lime window, the 
command word is recognized and the confidence value 
is calculated, as presented above in connection with the 
state 1 . If at this stage the calculated confidence value 
indicates that the command word uttered at this second 
stage is recognized with sufficient confidence, the oper- 
ation moves on to the state 4 and the recognition is ac- 
cepted. For example, in a situation when the user may 
have said something vague in the state 1 but has uttered 
the correct command word clearly in the state 2, the rec- 
ognition can be made solely on the basis of this com- 



mand word uttered in the state 2. Thus, no comparison 
will be made between the first and second command 
words uttered, because this would easily lead to a less 
secure recognition decision. 

5 [0018] Nevertheless, if the command word cannot be 
recognized with sufficient confidence in the state 2, the 
operation moves on to state 3 for comparison of the re- 
peated command words. If this comparison indicates 
that the command word repeated by the user was very 

to close to the command word first said by the user, i.e. 
the same word was probably uttered twice in succes- 
sion, the recognition is accepted and the operation 
moves on to the state 4. However, if the comparison in- 
dicates that the user has probably not said the same 

is word twice, the operation moves on to the state 5 and 
the recognition is rejected. 

[0019] Consequently, in the method of the invention, 
when the first step indicates an uncertain recognition, a 
second recognition is made preferably by a recognition 

20 method known as such. If this second step provides no 
sufficient certainty of the recognition, a comparison of 
the repetitions is made advantageously in the following 
way. In the state 1 , feature vectors formed of the com- 
mand word uttered by the user are stored in a speech 

25 response memory 4 (Fig. 5). Such feature vectors arc 
distinguished from the speech typically at intervals of 
ca. 10 ms, i.e. ca. 100 feature vectors per second. Also 
in the state 2, feature vectors formed of the command 
word uttered at this stage are stored in the speech re- 

30 sponse memory 4. After this, the recognition moves on 
to the state 3, in which these feature vectors stored in 
the memory are compared preferably by dynamic time 
warping. Figure 3 illustrates this dynamic time warping 
of feature vectors in a reduced manner. At the top of the 

35 figure are shown feature vectors produced by the first 
recognition, indicated with the reference number V1, 
and correspondingly, at the bottom of the figure are 
shown feature vectors produced by the second recog- 
nition and indicated with the reference number V2. In 

40 this example, the first word was longer than the second 
word, i.e. the user has said the word faster at the second 
stage, or the words involved are different Thus, for the 
feature vectors of the shorter word, in this case the sec- 
ond word, one or more corresponding feature vectors 

45 are found from the longer word by time warping the fea- 
ture vectors of the two words in a way that they corre- 
spond to each other optimally. In this example, these 
time warping results are indicated by broken lines in Fig. 
3. The distance between the words is calculated e.g. as 

50 a Euclidean distance between the warped feature vec- 
tors. If the calculated distance is small, it can be as- 
sumed that the words in question are different. Figure 4 
shows an example of this comparison as a histogram. 
The histogram includes two different comparisons: the 

55 comparison between two identical words (shown in solid 
lines) and a comparison between two different words 
(shown in broken lines). The horizontal axis is the loga- 
rithmic value of the calculated distance between the fea- 
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ture vectors, wherein a smaller value represents a 
smaller distance, and the vertical axis is the histogram 
value. Thus, the smaller particularly the high histogram 
values are, i.e. the calculated distance is very small the 
higher the probability that the words to be compared are 
the same. 

[0020] Figure 5 shows a wireless communication de- 
vice 1, such as a GSM mobile phone, controlled by 
speech commands according to an advantageous em- 
bodiment ot the invention. Only the most essential 
blocks for understanding the invention are presented in 
Fig. 5. A speech control unit 2 comprises preferably a 
speech recognition means 3, a speech response mem- 
ory 4, a central processing unit 5, a read-only memory 
6, a random access memory 7, a speech synthesizer S, 
and interface means 9. A command word can be en- 
tered e.g. with a microphone 10a in the wireless com- 
munication device 1 or with a microphone 10b in a 
hands-free facility 17. Instructions and notices to the us- 
er can be produced e.g. wilh the speech synthesizer 8 
either via a speaker 11 a integrated in the wireless com- 
munication device 1 or via a speaker 11b in the hands- 
free facility 17. The speech control unit 2 according to 
the invention can also be implemented without the 
speech synthesizer 8, wherein instructions and notices 
to the user are transmitted preferably in text format on 
a display means 13 in the telecommunication device. 
Yet another possibility is to transmit the instructions and 
notices as both audio and text messages to the user. 
The speech response memory 4 or a part of the same 
can be implemented also in connection with the random 
access memory 7, or it can be integrated in a possible 
general memory space of the wireless communication 
device. 

[0021] In the following, the operation of the wireless 
communication device according to the invention will be 
described further. For the speech control to function, the 
speech control unit 2 must be normally taught all the 
command words to be used. These have been taught 
preferably at the stage of manufacturing the device e.g. 
in a way that patterns corresponding to the command 
words are stored in the speech response memory 4, 
[0022] At the stage of recognizing a command word, 
the command word uttered by the user is converted by 
a microphone 10a, 10b to an electrical signal and con- 
veyed to the speech control unit 2. In the speech control 
unit 2, the speech recognition means 3 converts the ut- 
tered command word into feature vectors which are 
stored in the speech response memory 4. The speech 
control unit 2 additionally calculates for each command 
word in the vocabulary of the speech control unit 2 a 
probability to indicate how probably the command word 
uttered by the user is a certain command word in the 
vocabulary. After this, the speech control unit 2 exam- 
ines which command word in the vocabulary has the 
greatest probability value, wherein this word is prelimi- 
narily selected as a recognized command word. The cal- 
culated probability for this word is still compared with a 



probability produced by the background noise pattern, 
to determine a confidence value. This confidence value 
is compared by the speech control unit 2 with the first 
threshold value Y stored in the memory of the speech 

s control unit, preferably a read-only memory 6. If the 
comparison indicates that the confidence value is great- 
er than or equal to the first threshold value Y, the speech 
control unit 2 deduces that the command word uttered 
by the user was the word with the greatest probability 

io value. The speech control unit 2 converts this command 
word into a corresponding control signal, e.g. the signal 
for pressing inc button corresponding to the command 
word, which is transmitted via the interface 9 to the con- 
trol block 16 of the wireless communication device, or 

75 the like The control block 15 interprets the command 
and executes it as required, in a manner known perse. 
For example, the user can enter a desired telephone 
number by saying it or by pressing the corresponding 
buttons. 

20 [0023] In case the comparison above indicated that 
the confidence value is smaller than the first threshold 
value Y, a second comparison is made to the second 
threshold value A. If the comparison indicates that the 
confidence value is greater than or equal to the second 

25 threshold value A, the speech control unit 2 extends the 
time limit determined for recognizing the command word 
and watts if the user will utter the command word again. 
If the speech control unit 2 detects that the user utters 
a speech command within said time limit, the speech 

30 control unit takes all the measures presented above in 
the description of the advantageous embodiment of the 
invention, i.e. forming the feature vectors and storing 
them in the speech response memory 4, calculating the 
probabilities and a new confidence value. Next, a new 

35 comparison is made between the confidence value and 
the first threshold value Y. If the confidence value is 
greater than or equal to the first threshold value Y, the 
speech control unit 2 interprets that the speech com- 
mand was recognized correctly, wherein the speech 

40 control unit 2 converts the speech command into the 
corresponding control signal and transmits it to the con- 
trol block 16. However, if the confidence value is smaller 
than the first threshold value Y, the speech control unit 
2 makes a comparison between the feature vectors of 

45 the first and second words uttered and stored in the 
speech response memory 4. This comparison involves 
first time warping ot the feature vectors and then calcu- 
lating the distance between the words, as described 
above in connection with the description of the method. 

50 On the basis of these calculated distances, the speech 
control unit 2 deduces whether the words uttered by the 
user were the same or different words. If they were dif- 
ferent words, the speech control unit 2 did not recognize 
the command word and neither will it form a control sig- 

55 nal. If the command words were probably the same, the 
speech control unit 2 converts the command word to the 
corresponding control signal. 

[0024] In case the first command word was not rec- 
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ogn ized with sufficient confidence and the user does not 
repeat the command word within the time limit, the com- 
mand word is not accepted. In this case too no control 
signal is transmitted to the control block 16. 
[0025] To increase the convenience of use of the wire- 
less communication device 1 in those cases where the 
first recognition of the command word did not provide a 
sufficiently reliable recognition, the user can be in- 
formed of the failure of the recognition of the first stage 
and be requested to utter the command word again. The 
wireless communication device 1 forms e.g. an audio 
message with a speech synthesizer 8 and/or a visual 
message on a display means 1 3. The wireless commu- 
nication device 1 can inform the user with an audio and/ 
or visual signal also in a situation where the recognition 
was successful. Thus it will not remain obscure to the 
user whether the recognition was successful or not. This 
is particularly useful under noisy use conditions. 
[0026] Warping of feature vectors and calculating of 
distances between words is prior arl known porse, why 
these are not disclosed here in more detail. It is obvious 
that the present invention is not limited solely to the em- 
bodiments presented above but it can be modified within 
the scope of the appended claims. 



Claims 

1 . A method for recognizing speech commands by us- 
ing a time window, which is extendable when need- 
ed, in which method a group of command words se- 
lectable by speech commands are defined, a time 
window is defined, within which the recognition of 
the speech command is performed, and a first rec- 
ognition stage is performed, in which the recogni- 
tion result of the first recognition stage is selected, 
characterized in that further in the method: 

a) a first confidence value is determined for the 
recognition result of the first recognition stage, 

b) a first threshold value (Y) is determined, 

c) said first confidence value is compared with 
said first threshold value (Y), 

d) if said first confidence value is greater than 
or equal to said first threshold value (Y), the rec- 
ognition result of the first recognition stage is 
selected as the recognition result of the speech 
command, 

e) if said first confidence value is smaller than 
said first threshold value (Y), a second recog- 
nition stage is performed for the speech com- 
mand, wherein 

f) said time window is extended, and 

g) a second confidence value is determined for 
the recognition result of the second recognition 
stage, 

i) said second confidence value is compared 
with said threshold value (Y), 



j) if said second confidence value is greater 
than or equal to said first threshold value (Y), 
the command word selected at the second 
stage is selected as the recognition result for 

5 the speech command, 

k) if said second confidence value is smaller 
than said first threshold value (Y), a comparison 
stage is performed, wherein 
I) the first and second recognition results are 

10 compared to find out at which probability they 

are substantially the same, wherein if the prob- 
ability exceeds a predetermined value, the 
command word selected at the second stage is 
selected as the recognition result for the 

15 speech command. 

2. The method according to claim 1 , characterized in 
that at said recognition stages, a probability is de- 
termined for one or several said command words, 

20 at which the speech command ullered by the user 
corresponds to said command word, wherein the 
command word with the greatest determined prob- 
ability is selected as the recognition result of said 
recognition stages. 

25 

3. The method according to claim 1 or 2, character- 
ized in that in the method an additional second 
threshold value (A) is determined, wherein the stag- 
es e) to k) are performed only if said first confidence 

30 value is greater than said second threshold value 
(A). 

4. The method according to claim 3, characterized in 
that the comparison stage k) is performed only if 

35 said second confidence value is greater than said 
second threshold value (A). 

5. The method according to any of the claims 1 to 4, 
characterized in that for determining the first con- 

40 fidence value, a probability is determined for the first 
speech command being background noise, wherein 
the first confidence value is formed on the basis of 
the probability determined for the command word 
selected as the recognition result of the first recog- 

45 nition stage, and the background noise probability. 

6. The method according to any of the claims 1 to 5, 
characterized in that for determining the second 
confidence value, a probability is determined for the 

50 second speech command being background noise, 
wherein the second confidence value is formed on 
the basis of the probability determined for the com- 
mand word selected as the recognition result of the 
second recognition stage, and the background 

55 noise probability. 

7. A speech recognition device, in which a vocabulary 
of selectable command words is defined, the device 
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comprising means (5) for measuring the time used 
for recognition and comparing it with a predeter- 
mined time window, and means (3, 4, 5) for select- 
ing a first recognition result, characterized in that 
the speech recognition device comprises further: s 

means (3, 5) for calculating a first confidence 
value for said first recognition result, 
means (5) for comparing said first confidence 
value with a predetermined first threshold value io 
(Y), wherein the recognition result of the first 
recognition stage is arranged to be selected as 
the recognition result of the speech command, 
if said first confidence value is greater than or 
equal to said first threshold value (Y), and is 
means (5) for performing the recognition stage 
of a second speech command, if said first con- 
fidence value is smaller than said first threshold 
value (Y), the means for performing the recog- 
nition stage of the second speech command 20 
comprising: 

means (5) for extending said time window, 
means (3, 4, 5) for selecting the recognition re- 
sult of the second recognition stage, 
means (5) for calculating a second confidence 25 
value for said second recognition result, 
means (5) for comparing said second confi- 
dence value with the predetermined first 
threshold value (Y), wherein the recognition re- 
sult of the second recognition stage is arranged 30 
to be selected as the recognition result of the 
speech command, if said second confidence 
value is greater than or equal to said first 
threshold value (Y), and 

means (3, 4, 5) for performing a comparison 35 
stage, the comparison stage being arranged to 
be performed if said second confidence value 
is smaller than said first threshold value (Y). 

8, The speech recognition device according to claim 40 
7, characterized in that the means for performing 
the comparison stage comprise means (3, 4, 5) for 
comparing the first and second recognition results. 

9. A wireless communication device comprising 45 
means for recognizing speech commands, in which 

a vocabulary of selectable command words is de- 
fined, the means for recognizing speech commands 
comprising means (5) for measuring the time used 
for recognition and comparing it with a predeter- so 
mined time window, and means (3, 4, 5) for select- 
ing a first recognition result, characterized in that 
the means for recognizing speech commands com- 
prise further: 

55 

means (3, 5) for calculating a first confidence 

value for said first recognition result, 

means (5) for comparing said first confidence 



value with a predetermined first threshold value 
(Y), wherein the recognition result of the first 
recognition stage is arranged to be selected as 
the recognition result of the speech command, 
if said first confidence value is greater than or 
equal to said first threshold value (Y), and 
means (5) for performing the recognition stage 
of a second speech command, if said first con- 
fidence value is smaller than said first threshold 
value (Y), the means for performing the recog- 
nition stage of the second speech command 
comprising: 

means (5) for extending said time window, 
means (3, 4, 5) for selecting the recognition re- 
sult of the second recognition stage, 
means (5) for calculating a second confidence 
value for said second recognition result, 
means (5) for comparing said second confi- 
dence value with the predetermined first 
threshold value (Y), wherein the recognition re- 
sult of the second recognition stage is arranged 
to be selected as the recognition result of the 
speech command, if said second confidence 
value is greater than or equal to said first 
threshold value (Y), and 

means (3, 4, 5) for performing a comparison 
stage, the comparison stage being arranged to 
be performed if said second confidence value 
is smaller than said first threshold value (Y). 
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